DCT-based scalable video compression

ABSTRACT

A method of encoding an input video signal for communication over a computer network, the method comprising the steps of: i) dividing each frame into a two-dimensional array of macroblocks; ii) detecting motion between each macroblock of a current frame and the corresponding macroblock of a previous frame, and coding only those macroblocks where motion is detected; iii) replacing all coefficients of non-coded macroblocks with zero coefficients; iv) applying discrete cosine transformation to coded macroblocks; v) reorganizing coefficients into a multi-resolution representation; vi) quantizing the coefficients with a uniform scalar quantizer to produce a significance map; and vii) adaptive arithmetic coding of said signal by encoding the motion information, encoding the significance map, encoding the signs of all significant coefficients, and encoding the magnitudes of significant coefficients, in bit-plane order starting with the most significant bit.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a National Stage Application of InternationalApplication No. PCT/CA01/015 17, filed Oct. 24, 2001, which claims thebenefit under 35 USC 119(e) of the U.S. Provisional Patent ApplicationNo. 60/242,938, filed Oct. 24, 2000, where this provisional applicationis incorporated by reference in its entirety.

TECHNICAL FIELD

This invention relates generally to multimedia communications and moreparticular to methods and systems for the compression/decompression andencoding/decoding of video information transmitted over the Internet.

BACKGROUND

With the availability of high-performance personal computers andpopularity of broadband Internet connections, the demand forInternet-based video applications such as video conferencing, videomessaging, video-on-demand, etc. is rapidly increasing. To reducetransmission and storage costs, improved bit-ratecompression/decompression (“codec”) systems are needed. Image, video,and audio signals are amenable to compression due to considerablestatistical redundancy in the signals. Within a single image or a singlevideo frame, there exists significant correlation among neighboringsamples, giving rise to what is generally termed “spatial correlation”.Also, in moving images, such as full motion video, there is significantcorrelation among samples in different segments of time such assuccessive frames. This correlation is generally referred to as“temporal correlation”. There is a need for an improved, cost-effectivesystem and method that uses both spatial and temporal correlation toremove the redundancy in the video to achieve high compression intransmission and to maintain good to excellent image quality, whileadapting to change in the available bandwidth of the transmissionchannel and to the limitations of the receiving resources of theclients.

A known technique for taking advantage of the limited variation betweenframes of a motion video is known as motion-compensated image coding. Insuch coding, the current frame is predicted from the previously encodedframe using motion estimation and compensation, and only the differencebetween the actual current frame and the predicted current frame iscoded. By coding only the difference, or residual, rather than the imageframe itself, it is possible to improve image quality, for the residualtends to have lower amplitude than the image, and can thus be coded withgreater accuracy. Motion estimation and compensation are discussed inLim, J. S. Two-Dimensional Signal and Image Processing, Prentice Hall,pp. 497-507 (1990). However, motion estimation and compensationtechniques have high computational cost, prohibiting software-onlyapplications for most personal computers.

Further difficulties arise in the provision of a codec for an Internetstreamer in that the bandwidth of the transmission channel is subject tochange during transmission, and clients with varying receiver resourcesmay join or leave the network as well during transmission. Internetstreaming applications require video encoding technologies with featuressuch as low delay, low complexity, scalable representation, and errorresilience for effective video communications. The current standards andthe state-of-the-art video coding technologies are proving to beinsufficient to provide these features. Some of the developed standards(MPEG-1, MPEG-2) target non-interactive streaming applications. AlthoughH.323 Recommendation targets interactive audiovisual conferencing overunreliable packet networks (such as the Internet), the applied H.26xvideo codecs do not support all the features demanded by Internet-basedapplications. Although new standards such as H.263+ and MPEG-4 startedto address some of these issues (scalability, error resilience, etc.),the current state of these standards is far from being complete in orderto support a wide range of video applications effectively.

Due to very heterogeneous networking and computing infrastructure,highly scalable video coding algorithms are required. A video codecshould provide reasonable quality to low-performance personal computersconnected via a dial-up modem or a wireless connection, and high qualityto high-performance computers connected using T1. Thus the compressionalgorithm is expected to scale well in terms of both computational costand bandwidth requirement.

Real Time Protocol (RTP) is most commonly used to carry time-sensitivemultimedia traffic over the Internet. Since RTP is built on theunreliable user datagram protocol (UDP), the coding algorithm must beable to effectively handle packet losses. Furthermore, due to low-delayrequirements of the interactive applications and multicast transmissionrequirements, the popular retransmission method widely deployed over theInternet cannot be used. Thus the video codec should provide high degreeof resilience against network and transmission errors in order tominimize impact on visual quality.

Computational complexity of the encoding and decoding process must below in order to provide reasonable frame rate and quality onlow-performance computers (PDAs, hand-held computers, etc.) and highframe-rate and quality on average personal computers. As mentioned, thepopularly applied motion estimation and motion compensation techniqueshave high computational cost prohibiting software-only applications formost personal computers.

SUMMARY OF INVENTION

This invention provides a new method of video encoding by performingreorganization, significance mapping, and bitstream layering ofcoefficients derived from discrete transform operations on originalframes in which motion has been detected. Moving areas of a frame areupdated using intracoding or differential coding based upon thedifference between a current portion of a frame and the same portionfrom a previous frame. The previous frame can be either the previousoriginal frame or a previous reconstructed frame. In order to increaseerror resilience, part of the frame can be periodically updated(intracoded) in a distributed manner, regardless of whether or notmotion is detected. Discrete cosine transform (DCT) is carried out insub-portions of each of the coded frame portions. The resultingcoefficients are reorganized into a three-level multiple-resolutionrepresentation. The coefficients are then quantized into significant(one) and insignificant (zero) values, which determine a significancemap. The map is rearranged by order of significance into bitstreamlayers that are encoded using adaptive arithmetic coding. The codingscheme provides scalability, high coding performance, and errorresilience while incurring low coding delays and computationalcomplexity.

BRIEF DESCRIPTION OF DRAWINGS

In Figures which illustrate non-limiting embodiments of the invention:

FIG. 1 is a functional block diagram of the encoder of a codec operatedin accordance with an embodiment of the invention.

FIG. 2 is a functional block diagram of the encoder of a codec operatedin accordance with an alternative embodiment of the invention, involvingframe reconstruction at the encoder.

FIG. 3 is a pictorial representation illustrating how portions of aframe are periodically updated (intracoded) in a distributed manner, inaccordance with an embodiment of the invention.

FIG. 4 is a pictorial representation illustrating the reorganization ofDCT coefficients into a multi-resolution structure.

FIG. 5 is a pictorial representation illustrating the spatialscalability provided by a codec according to the invention.

FIG. 6 is a pictorial representation illustrating the temporalscalability provided by a codec according to the invention.

FIG. 7 is a pictorial representation illustrating a probability modelfor coding/transmitting significance information.

FIG. 8 is a pictorial representation illustrating a probability modelfor coding/transmitting the sign and magnitude of significant DCTcoefficients.

DESCRIPTION

Throughout the following description, specific details are set forth inorder to provide a more thorough understanding of the invention.However, the invention may be practiced without these particulars. Inother instances, well known elements have not been shown or described indetail to avoid unnecessarily obscuring the present invention.Accordingly, the specification and drawings are to be regarded in anillustrative, rather than a restrictive, sense.

Each of FIG. 1 and FIG. 2 depict a functional block diagram of theencoder of a codec according to embodiments of the invention. In eachcase, the encoding process consists of the following steps:

-   -   Motion detection;    -   Coefficients reorganization;    -   Coefficients partitioning;    -   Bitstream layering; and    -   Application of adaptive arithmetic coding.        Each of these steps will be explained in detail below. This        process can be implemented using software alone, if desired.

FIG. 1 represents a simple encoding architecture according to theinvention. It uses the previous original frame as the reference. Theencoder does not maintain the state of the decoder, which means thaterror may accumulate between the original and reconstructed frames.Although error accumulation lowers the signal-to-noise ratio of theframes, a measure traditionally used for comparing objectiveperformance, research has determined that the impact on visual qualityis minimal. The codec architecture illustrated in FIG. 1 provides higherror resilience and low computational complexity.

FIG. 2 represents a more sophisticated encoding architecture accordingto the invention, including a step of frame reconstruction at theencoder itself. In FIG. 2, the previously reconstructed frame is used asthe reference for motion detection. Thus the encoder maintains the stateof the decoder and the error accumulation (as in the previous structure)is avoided resulting in higher objective performance (highersignal-to-noise ratio). However, this comes at the cost of highercomputational complexity of the encoder and higher error sensitivity ofthe bitstream.

Motion Detection

Motion detection is not to be confused with the computationally complex“motion estimation and compensation” technique described above. Motiondetection does not involve predicting/estimating the current frame basedon a previous frame and encoding only the difference between theprediction/estimation and the actual frame. Rather, motion detectionaccording to the present invention simply detects actual differencesbetween the current frame and a previous frame, offering high errorresilience and low computational complexity.

Motion detection is based on conditional replenishment—that is, onlymoving areas of the frame are updated using intracoding (encodingprocess which does not use data from the reference frame) and/ordifferential coding (encoding process where only the difference with thecorresponding reference frame data is encoded). In one embodiment of theinvention, each frame is divided into a two-dimensional array ofmacroblocks (MBs) each of size 16 pixels by 16 pixels, and thisprocedure is carried out for each MB separately. As illustrated in FIG.1 and FIG. 2, motion detection may use either the previous originalframe or the previous reconstructed frame as the reference.

When the previous original frame is used as reference, as in theembodiment illustrated in FIG. 1, each MB may have two states. If thedifference between current MB and previous original MB is larger than asingle given threshold, then the MB is coded, preferably by intracoding.Otherwise, the MB is not coded at all and the decoder replaces thepixels within the MB from the previous frame.

When the previous reconstructed frame is used as reference, as in theembodiment illustrated in FIG. 2, each MB may have three states. Twothresholds T₁ and T₂ (T₁<T₂) are defined. If the difference between thecurrent MB and previous reconstructed MB is smaller than T₁, then the MBis not coded. If the difference is larger than T₁ but smaller then T₂,then preferably the difference between the current MB and previouslyreconstructed MB is coded (that is, by differential coding). Finally, ifthe difference is larger than T₂, then the MB is coded, preferably byintracoding.

In order to increase error resilience, part of the frame may beperiodically updated (intracoded) in a distributed manner, asillustrated in FIG. 3. Based on extensive experiments over the Internet,10% of the frame is preferably intracoded regardless of motion—that is,the entire frame is fully intracoded in ten frames. This forcedintraupdate results in significant performance gain in lossy networkssuch as the Internet compared to the traditional I-frame refresh appliedin standard video codecs such as H.263 and MPEG-2.

DCT Coefficients Reorganization

After motion detection, each MB is preferably further partitioned intofour nonoverlapping blocks each of size 8 pixels by 8 pixels. If the MBis not coded, then all the coefficients are replaced with zerocoefficients. Otherwise, discrete cosine transform (DCT) is carried outon each of the blocks. After DCT, coefficients are reorganized into athree-level multi-resolution representation such as that shown in FIG.4. Coefficients reorganization provides two advantages:

-   -   It increases error resilience significantly. The effects of        packet losses will be uniformly distributed over the entire        image instead of being concentrated into specific image regions,        providing less disturbing visual artifacts.    -   Spatial scalability can be efficiently supported, as shown in        FIG. 5.        DCT Coefficients Partitioning

After DCT transformation and data reorganization, coefficients arequantized with a uniform scalar quantizer. DCT coefficients that arequantized to nonzero are termed significant coefficients. DCTcoefficients that are quantized to zero are insignificant coefficients.Thus the quantization procedure determines a map of significantcoefficients which can be described as a significance map. Thesignificance map is a binary image of size equal to that of the originalimage. Binary “0” means that the DCT coefficient at the correspondinglocation is insignificant, and binary “1” means that the DCT coefficientat the corresponding location is significant.

Bitstream Layering

One of the key advantages of a codec according to this invention is itsscalability in multiple dimensions. It provides three levels of temporalscalability and three levels of spatial scalability. FIG. 6 illustratesthree levels of temporal scalability. In FIG. 6, “Q” means quarterresolution, “H” means “half resolution”, “F” means “full resolution”,and “GOF” means “group of frames”.

Take, for example, a video sequence encoded at 30 frames/second. Thereceiver is able to decode the video at 30 frames/second, 15frames/second, or 7.5 frames/second. A codec according to the presentinvention also supports three levels of spatial scalability, as shown inFIG. 5. This means that if the original video is encoded at 352×288pixel resolution, then it can be decoded at full spatial resolution(352×288 pixels), half spatial resolution (176×144 pixels), or quarterspatial resolution (88×72 pixels) as well.

In addition to temporal and spatial scalability, SNR scalability is alsosupported by transmitting significant coefficients in bit-plane orderstarting from the most significant bit.

Adaptive Arithmetic Coding

Adaptive arithmetic coding is a four-stage procedure. First, the motioninformation is encoded. Second, the significance map is encoded. Third,the signs of all significant coefficients need are encoded. Finally,magnitudes of significant coefficients are encoded.

The motion information is encoded by using adaptive arithmetic coding.If the previous original frame is used as the reference, only twosymbols are needed (“not coded” and “intracoded”). If the previousreconstructed frame is used as reference, then three symbols are usuallyneeded (“not coded”, “differentially coded”, and “intracoded”).

The significance information of the DCT coefficients of the coded MBsneeds to be encoded. The frame is scanned in “subband” order from coarseto fine resolution. Within each subband, the coefficients are scannedfrom top to bottom, left to right. It is observed that a largepercentage (about 80%) of the DCT coefficients is insignificant. Binaryadaptive arithmetic coding is used to encode the significanceinformation of DCT coefficients. SIG and INSIG symbols denotesignificant and insignificant coefficients, respectively. Duringadaptive arithmetic coding, each pixel may be coded assuming a differentprobability model (each a “context”). In one embodiment, the context ofeach pixel is based on the significance status of its four neighboringpixels in its causal neighborhood as shown in FIG. 7. In FIG. 7, thecausal neighborhood of a pixel is the set of four pixels (3 pixels inthe previous row and one pixel in the previous column) which are alreadyencoded or decoded. Since the encoder and decoder shall use the sameprobability model during the coding process, the probability model ofthe current pixels can be based only on the knowledge of the alreadytransmitted pixels. In this embodiment of the invention, a total of fiveprobability models are used (0 significant, 1 significant, . . . , 4significant).

The sign of significant DCT coefficients is also encoded using binaryadaptive arithmetic coding. POS and NEG symbols denote positive andnegative significant DCT coefficients, respectively. In this case, thecontext is determined by the number of significant coefficients in asmall spatial neighborhood shown in FIG. 8. This context is calculatedfrom the already transmitted significance map. In this embodiment of theinvention, a total of five models are used (0 significant, 1significant, . . . , 4 significant).

Magnitudes of significant DCT coefficients are encoded in bit-planeorder. Again, a binary adaptive arithmetic codec with two symbols isused to encode each bit-plane. As with the sign of significantcoefficients, the context is determined by the number of significant DCTcoefficients in a small neighborhood such as that shown in FIG. 8. Thetransmission order proceeds from the most significant bit-plane to theleast significant bit-plane providing SNR scalability.

In respect of the embodiment of the invention illustrated in FIG. 2,involving frame reconstruction, an inverse discrete cosine transform iscarried out on the significance maps for the purpose of reconstructingeach frame at the encoder.

A video codec according to the present invention provides the followinghighly desirable features:

-   -   Low coding delay: Since no frame buffering is needed, the codec        provides the lowest possible coding delay. As soon as a frame        arrives from the capture device, it is coded immediately.    -   Low computational complexity: The encoder and decoder of the        codec have low and approximately symmetric computational        complexity. This is because the method of the invention uses a        simpler process of motion detection than the traditional process        of motion estimation and compensation. The encoder has lower        complexity than block-based motion estimation/compensation        schemes by an order of magnitude. One implementation of the        codec provides about 40 frames/second with a frame size of        176×144 pixels and with both the encoder and decoder executing        on a 667 MHz Pentium III computer with a Windows 2000 operating        system.    -   Scalable coding: The codec provides temporal, resolution, and        SNR scalability. This enables a distributed video application        such as video conferencing to dynamically adjust video        resolution, frame rate, and picture quality of the received        video depending on the available network bandwidth and hardware        capabilities of the receiver. Receivers with high bandwidth        network connections and high-performance computers may receive        high quality color video at high frame rate while receivers with        low bandwidth connections and low-performance computers may        receive lower quality video at a lower frame rate.    -   Error resilience: The bitstream of the codec provides a high        degree of error resilience for networks with packet loss such as        the Internet. The encoding technique limits the effect of        channel errors in the smallest possible temporal and spatial        neighborhood. Undesirable temporal and spatial error propagation        is prevented by using intraupdate and/or differential update        instead of motion estimation/compensation. Furthermore, the        applied coefficients reorganization and distributed forced        intraupdate significantly increases the error resilience of the        codec.    -   High coding performance: High coding performance is needed in        order to provide users with high video quality over low        bandwidth network connections. A more efficient compression        algorithm means that users receive higher picture quality for a        given bit rate. The codec provides similar visual quality when        compared with H.263 for low motion scenes, and a little        compromise in visual quality for high motion scenes (due to lack        of motion estimation and compensation). However, the performance        compromise for high motion scenes is more than justified by the        low coding delay, low computational complexity, scalable        bitstream, and high error resilience.

As will be apparent to those skilled in the art in the light of theforegoing disclosure, many alterations and modifications are possible inthe practice of this invention without departing from the scope thereof.Accordingly, the scope of the invention is to be construed in accordancewith the substance defined by the following claims.

1. A method of encoding an input video signal comprising a group ofvideo frames for communication over a computer network, the methodcomprising the steps of: i) dividing each frame into a two-dimensionalarray of macroblocks; ii) detecting motion between each macroblock of acurrent frame and the corresponding macroblock of a previous frame, andcoding only those macroblocks where motion is detected; iii) replacingall coefficients of non-coded macroblocks with zero coefficients; iv)applying discrete cosine transformation (DCT) to coded macroblocks toproduce DCT coefficients; v) reorganizing coefficients into amulti-resolution representation; vi) quantizing the coefficients with auniform scalar quantizer to produce a significance map; and vii)adaptive arithmetic coding of said signal by a) encoding the motioninformation; b) encoding the significance map; c) encoding the signs ofall significant coefficients; and d) encoding the magnitudes ofsignificant coefficients, in bit-plane order starting with the mostsignificant bit.
 2. The method of claim 1 wherein said previous frame isthe previous original frame.
 3. The method of claim 2 wherein the motiondetection and coding step is such that: i) if the motion detected doesnot exceed a threshold, then said macroblock is not coded; and iii) ifthe motion detected exceeds said threshold, then said macroblock isintracoded.
 4. The method of claim 1 further comprising the step ofreconstructing each frame from the motion information and significancemap, and wherein said previous frame is the previous reconstructedframe.
 5. The method of claim 4 wherein the motion detection and codingstep is such that: i) if the motion detected does not exceed a firstthreshold, then said macroblock is not coded; ii) if the motion detectedexceeds said first threshold but does not exceed a second threshold,then the difference between said current frame and said previous frameis coded; and iii) if the motion detected exceeds both first and secondthresholds, then said macroblock is intracoded.
 6. The method of claim 1further comprising the step, regardless of whether or not motion isdetected, of periodically intracoding portions of each frame in adistributed manner.
 7. A computer program product for encoding an inputvideo signal comprising a group of video frames for communication over acomputer network, said computer program product comprising: (A) acomputer usable medium having computer readable program code meansembodied in said medium for: i) dividing each frame into atwo-dimensional array of macroblocks; ii) detecting motion between eachmacroblock of a current frame and the corresponding macroblock of aprevious frame, and coding only those macroblocks where motion isdetected; iii) replacing all coefficients of non-coded macroblocks withzero coefficients; iv) applying discrete cosine transformation (DCT) tocoded macroblocks to produce DCT coefficients; v) reorganizingcoefficients into a multi-resolution representation; vi) quantizing thecoefficients with a uniform scalar quantizer to produce a significancemap; and vii) adaptive arithmetic coding of said signal by a) encodingthe motion information; b) encoding the significance map; c) encodingthe signs of all significant coefficients; and d) encoding themagnitudes of significant coefficients, in bit-plane order starting withthe most significant bit.
 8. The computer program product of claim 7wherein said previous frame is the previous original frame.
 9. Thecomputer program product of claim 8 wherein the motion detection andcoding means are such that: i) if the motion detected does not exceed athreshold, then said macroblock is not coded; and iii) if the motiondetected exceeds said threshold, then said macroblock is intracoded. 10.The computer program product of claim 7 wherein said computer usablemedium further has computer readable code means embodied in same mediumfor reconstructing each frame from the motion information andsignificance map, and wherein said previous frame is the previousreconstructed frame.
 11. The computer program product of claim 10wherein the motion detection and coding means are such that: i) if themotion detected does not exceed a first threshold, then said macroblockis not coded; ii) if the motion detected exceeds said first thresholdbut does not exceed a second threshold, then the difference between saidcurrent frame and said previous frame is coded; and iii) if the motiondetected exceeds both first and second thresholds, then said macroblockis intracoded.
 12. The computer program product of claim 7 wherein saidcomputer usable medium further has computer program code means for,regardless of whether or not motion is detected, periodicallyintracoding portions of each frame in a distributed manner.